This tutorial was created for a 3-day free R introduction workshop organized by the Laforest-Lapointe Lab at Université de Sherbrooke on May 12th-14th.
R is an object-oriented programming language. Essentially, this means that data, values, inputs and outputs are saved as “objects” whose names are defined by the user. Objects can then be used to perform calculations, data manipulation, or visualisations.
Objectives:
RStudio is a program that allows you to write a series of command as a script, have R execute these commands, visualise the output and see the objects that you create.
When you open RStudio for the first time, you will have three panels.
Top left: Script
Document where you will write the command lines. To create a script:
File > New File > R Script. Save regularly!
Execute commands, place your cursor in the command’s line and press:
Command + Enter for Mac, Control + Enter for
Windows.
Bottom left: Console
Where the output of the command line will appear. When you execute commands from the script, the command and its output (or error messages!) will appear there.
Use it to execute commands that you do not want to save in the script
or to test your commands before adding them to the script. To execute a
command from the console, simply press enter after writing
it.
Top right: Environment
Place where all the objects you create (dataframes, values, vectors) are shown. Click on any object to visualise it.
Bottom right: Files, Plots, Help
This is a files and plots viewer as well as the help center.
The very simplest thing you can do is use R as a calculator and get the output from the console:
## [1] 6
But we like to do more exciting things. Generally, we save values in objects and then execute operations on those objects.
Most R operations are written like this:
an_object <- a_value
This stores a value in an object named an_object. The value can be, for example, a single number, a character string, or a complex object like a matrix or a list. It can be user-written (“hardcoded”) like above, or be the output of a function, such as:
an_object <- a_function(values, some_parameters)
In this case, the function takes values as an input, applies operations to it, and spits out new values. Parameters are generally optional ways to control the behaviour of the function. Concrete examples will be presented in the next few sections.
## [1] 6
## [1] 12
R uses many characters that have a specific role. Some are essential to know, which we present in this section.
$ access an element of an object by its name (e.g. the
column of a data frame)[] access an element of an object by its index
(e.g. vector[3] accesses the 3rd element in than
vector)<- assign an object (shortcut: alt +
- on Windows, option+ - on
mac)= assign an object; specify function parameters"" and '' are used to express strings
(character values)? used to learn how a function work. ?sum
will open a help file where you can learn about how to use that
function.To compare values to one another:
== equal!= is not equalThe following can only be used for numeric values:
< smaller than> greater than<= smaller or equal to>= greater or equal to## [1] FALSE
## [1] TRUE
TRUE and FALSE are boolean values with a
specific meaning. They represent a type of information that can only be
true or false.
## [1] FALSE
Objects and object components (e.g., data frame variables) always have a class. This class gives you a hint on how it will behave.
There exist many object classes, but a few are much more common than
others. In the above example with str(), we can see two
other classes, namely num for numeric vectors and
factor, which is a special (composite) class for
categorical data. Other common classes are character (chr),
integer (int), character (chr) and
logical (logi).
We can also see a composite class, data.frame. These are
used for more complex objects, usually with two dimensions. Other
composite classes include matrix, which are similar to data
frames (the difference between these two is that data frames can contain
elements of various classes). There are also list and
array, which won’t be covered in this course.
Character
Numeric
Integer
Factor
Logical
Date
Vector
Data frame
Matrix
Let’s create a few objects and look at the class R will assign them
by default. There are a few basic functions worth learning about at this
point. Functions are little programs that do specific things and you’ll
be using them all the time. The most basic function is c()
which is used to combine its arguments into a vector. If all arguments
are numeric, the vector will be numeric. If they contain at least one
string element (wrapped in "quotes"), it will be a
character vector.
# Create a vector of numeric values
vec1 <- c(1,3,5)
# Create a vector of characters values
vec2 <- c("one","three","five")
# Create a vector of logical values
vec3 <- c(TRUE, TRUE, FALSE)
# Bind these vectors into a dataframe
df1 <- data.frame(vec1, vec2, vec3)
class(vec1) # Use function class( ) to get the class of an object## [1] "numeric"
## [1] "character"
## [1] one three five
## Levels: five one three
## [1] "factor"
## [1] "data.frame"
## 'data.frame': 3 obs. of 3 variables:
## $ vec1: num 1 3 5
## $ vec2: chr "one" "three" "five"
## $ vec3: logi TRUE TRUE FALSE
# Save object
saveRDS(df1, "df1.RDS")
# Read/open object
df1_new <- readRDS("df1.RDS")
This is very useful when you need to share R objects as is with collaborators. It’s also useful when you work with large datasets and want to save time. For example, you could have a first script dedicated to producing a certain output, which you save as an RDS object. Then, in a second script, this data can be taken as input and be worked on. This way, when working on your second script, you don’t need to regenerate this large input dataset every time you test the code on that second script.
Create the following objects with 4 made-up values each:
sample_id: a character vectorCFU: a numeric vectorControl: a boolean vector telling us which samples are
controlsDate : a vector of dates with format “YYYY-MM-DD”Then,
str(), check the class of your data frame
columnsNow let’s import a small dataset and familiarise ourselves with a few functions that will help us look at our data in more detail.
It’s good to have a working directory specifically dedicated to your project, where you store your data, scripts, output, etc. This way you can organise what goes in and comes out of R and keep track of your various scripts.
R can look anywhere in our computer for files. However, to load a
file, you need to know its full path,
e.g. '/Users/jorondo/Repos/RworkshopUdS_2025/data.txt'
which can be annoying to write out. Instead, it’s useful to set your
working directory once at the beginning of your script, and then use
relative paths to find your file. To do this, start by creating a folder
in your computer, from where to work:
setwd('~/RWorkshop2025')
dir.create('output')
dir.create('data')
The ~ is a shortcut to your Home folder. You can always
create your folders by hand, but it helps to have the path in your
script. Now, if we save an object we previously created,
saveRDS(my_first_data_frame, "output/my_first_data_frame.RDS")
It will be saved in that directory without needing to specify the absolute path.
In any project, a first step should always be to look at your data as
it appears in R. Many functions exist to help you do that.
summary() will summarise variables in a class-dependent
manner: numeric variables get a numeric summary (median, range, etc.),
factors and logical get a count per category, and characters get a
overview of the first values.
## CFU Antibiotic OD
## Min. : 4.20 Length:60 Min. :0.500
## 1st Qu.:13.07 Class :character 1st Qu.:0.500
## Median :19.25 Mode :character Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
Other useful functions are head(), which shows the first
rows of an object, and View(), which opens shows the object
in a formatted manner, a bit like an excel sheet.
## CFU Antibiotic OD
## 1 4.2 streptomycin 0.5
## 2 11.5 streptomycin 0.5
## 3 7.3 streptomycin 0.5
## 4 5.8 streptomycin 0.5
## 5 6.4 streptomycin 0.5
## 6 10.0 streptomycin 0.5
Other useful checks are dim() or length(),
especially when playing with functions that modify our objects:
## [1] 60 3
## [1] 3
## [1] "CFU" "Antibiotic" "OD"
## [1] 60
R can be used to operate on values in pretty much any way imaginable, from very simple things like calculating the mean to more complex things like removing certain characters from every values in a data frame variable.
Most mathematical operations are available as functions :
## [1] 4.2 11.5 7.3 5.8 6.4 10.0 11.2 11.2 5.2 7.0
## [1] 2.049390 3.391165 2.701851 2.408319 2.529822 3.162278 3.346640 3.346640
## [9] 2.280351 2.645751
a_sum_value <- sum(some_numbers) # Sum the values of a numeric vector
a_length_value <- length(some_numbers) # Count the number of elements in that vector
a_sum_value; a_length_value # print them to the console## [1] 79.8
## [1] 10
## [1] 7.98
## [1] 7.98
## [1] 7.15
## [1] 2.746634
## [1] 4.2
Operations can also be done between columns of a data frame. In the
following, we multiply two columns together and add the resulting vector
as a new column on the same dataset using the <-
(assignment) operator.
assay$CFU_times_OD <- assay$CFU * assay$OD # with $ sign
assay['CFU_times_OD'] <- assay$CFU * assay$OD # equivalent!
assay['log_CFU'] <- log10(assay$CFU)
summary(assay) ## CFU Antibiotic OD CFU_times_OD
## Min. : 4.20 Length:60 Min. :0.500 Min. : 2.100
## 1st Qu.:13.07 Class :character 1st Qu.:0.500 1st Qu.: 6.875
## Median :19.25 Mode :character Median :1.000 Median :19.250
## Mean :18.81 Mean :1.167 Mean :25.746
## 3rd Qu.:25.27 3rd Qu.:2.000 3rd Qu.:46.750
## Max. :33.90 Max. :2.000 Max. :67.800
## log_CFU
## Min. :0.6232
## 1st Qu.:1.1153
## Median :1.2843
## Mean :1.2287
## 3rd Qu.:1.4027
## Max. :1.5302
As you can see, Base R has hundreds of functions available. But
sometimes, we need more. That’s when packages come in handy. Packages
are collections of functions developed by people around the world for
various purposes. These are made available freely to the public, and R
has a function to install these packages. For example, you could run
install.packages('tidyverse'), which we’ll use tomorrow.
Now, whenever you use R, you can gain access to hundreds of new
functions by adding library(tidyverse) at the beginning of
your script.
Often, you’ll need multiple packages for your project. Instead of
writing a series of library() commands at the top of your
script, you can install the pacman package and use its very
useful function, p_load(). This command installs AND loads
any package, and you can load as many packages as you want with a single
command.
Remember: every time you open RStudio, you’ll need to load the packages you need. A good practice is to put the package loading commands at the top of all your scripts.
Rocio_tomato_modified.xlsx
available on the Workshop
repository or on the Teams documentsp_load() to install and
load the package readxl, which we’ll need for the next
tutorialread_xlsx() to read the file into an objectchlorotic.area column and print
itAs briefly shown at the beginning, you can also straight up use R as you would a calculator. It’s useful for complex arithmetic because you can write your whole equation and then execute it. For example, let’s find the area of a single slice of a 14” pizza cut in 8, then format the result to be more “readable” :
diameter <- 14
slices <- 8
slice_area <- (pi*(diameter/2)^2)/slices
slice_area_formatted <- round(slice_area, 2)
message('The slice of a 14 inches pizza cut in 8 will have a surface of ', slice_area_formatted, " square inches.")## The slice of a 14 inches pizza cut in 8 will have a surface of 19.24 square inches.
Notice that we’ve created many objects to do this. We could just as well have done a single-liner that only prints the result to our console:
paste(
'The slice of a 14 inches pizza cut in 8 will have a surface of',
round((pi*(14/2)^2)/8, 2),
"square inches."
) # notice that I can split my command on multiple lines without affecting it!## [1] "The slice of a 14 inches pizza cut in 8 will have a surface of 19.24 square inches."
but it’s up to you to decide how explicit you want your code to be. More explicit code is more readable, but requires more steps to be written out. It’s all about balance.
You can also use R as a random number generator. The simplest way is
using the uniform distribution, runif(), which draws
n observations between the specified parameters
min and max.
## [1] 97.24192 31.69451 26.69018 34.24109 51.03781 47.47513 17.20659 68.85718
## [9] 85.22271 28.45370
## [1] 17.20659 26.69018 28.45370 31.69451 34.24109 47.47513 51.03781 68.85718
## [9] 85.22271 97.24192
Objectives:
Let’s look at a new, imperfect dataset, which is most likely what you will be dealing with. We thank workshop participant Maria Rocio Gonzalez-Lamothe for sharing it with us. This data comes from an experiment with tomato plants, which have been infected (or not) with a pathogen and have either received (or not) a treatement. The data collected the total area of the plant leaf, and the total area affected by chlorosis.
Earlier we used a dataset that was saved in a text file with columns
delimited by tabulations (using the read.table() function).
However, many of you save your data in excel files.
It is important to create objects with explicit and short names, that
are not already built-in functions in R. Ex: do not create a data named
data or summary or plot.
Let’s load a new dataset to work with !
library(pacman)
p_load(readxl)
#tomato_data_2025_ILL_test_repeat<-read.csv("Rocio_tomato_modified.csv")
tomato <- read_xlsx('data/Rocio_tomato_modified.xlsx', sheet = 2)
#tomato<-read.csv("data/Rocio_tomato_modified.csv")Let’s look at this dataset.
## # A tibble: 6 × 5
## rep Treatement Infection total.area chlorotic.area
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 mock water 2663192 83409
## 2 1 mock water 3466704 111165
## 3 1 mock water 3604993 52202
## 4 1 treated water 3605450 86927
## 5 1 treated water 2323251 46141
## 6 1 treated water 2938431 40150
## rep Treatement Infection total.area
## Min. :1.00 Length:48 Length:48 Min. :1034441
## 1st Qu.:1.75 Class :character Class :character 1st Qu.:2194426
## Median :2.50 Mode :character Mode :character Median :2856845
## Mean :2.50 Mean :2646678
## 3rd Qu.:3.25 3rd Qu.:3291366
## Max. :4.00 Max. :4389281
## chlorotic.area
## Min. : 14666
## 1st Qu.: 31068
## Median : 54448
## Mean : 81711
## 3rd Qu.: 87013
## Max. :365390
Next, we want to make sure that variables are formatted adequately.
Here, we have variables that are numeric (allowing for mathematical
operations) and some that are factors (with levels). We will transform
some in factors and take a look at the summary() to confirm
the successful transformation.
tomato$rep<-as.factor(tomato$rep)
tomato$Treatement<-as.factor(tomato$Treatement)
tomato$Infection<-as.factor(tomato$Infection)
summary(tomato)## rep Treatement Infection total.area chlorotic.area
## 1:12 mock :24 pathogen:23 Min. :1034441 Min. : 14666
## 2:12 treated:24 pathogn : 1 1st Qu.:2194426 1st Qu.: 31068
## 3:12 water :24 Median :2856845 Median : 54448
## 4:12 Mean :2646678 Mean : 81711
## 3rd Qu.:3291366 3rd Qu.: 87013
## Max. :4389281 Max. :365390
We can observe quickly that a typo is now present in our data. We
could either correct it in the raw data file and reimport a new object,
or, to increase reproducibility, correct it in R using the
which() function and dimensionality operators.
## [1] 10
## [1] pathogn
## Levels: pathogen pathogn water
tomato$Infection[10]<-"pathogen" #correct the error FOREVER by overwritting
tomato$Infection<-droplevels(tomato$Infection) #erase the empty factor level
summary(tomato) #check the summary again## rep Treatement Infection total.area chlorotic.area
## 1:12 mock :24 pathogen:24 Min. :1034441 Min. : 14666
## 2:12 treated:24 water :24 1st Qu.:2194426 1st Qu.: 31068
## 3:12 Median :2856845 Median : 54448
## 4:12 Mean :2646678 Mean : 81711
## 3rd Qu.:3291366 3rd Qu.: 87013
## Max. :4389281 Max. :365390
Histograms are a fundamental tool in R for exploratory data analysis (EDA). They help you:
Visualize Distributions: Quickly see the shape of your data (e.g., normal, skewed, bimodal).
Spot Outliers & Gaps: Identify unusual values or missing ranges in your dataset.
Check Assumptions: Assess if data meets statistical assumptions (e.g., normality for t-tests).
Compare Groups: Overlay histograms to compare different subsets (e.g., treatment vs. control).
In R, creating a histogram is simple (e.g., hist(data)), making it an essential first step in understanding your data before diving into analysis.
par(mfrow=c(1,2))# Create a 1 x 2 plotting matrix
hist(tomato$total.area)
hist(tomato$chlorotic.area)Scatterplot and boxplots are also very common to look at the distribution of your variables. Here are a few examples:
par(mfrow=c(2,2))# Create a 2 x 2 plotting matrix
# The next 4 plots created will be plotted next to each other
plot(tomato$total.area~tomato$Treatement)
plot(total.area~Treatement, data=tomato) #this is the same but the XY labels change
plot(total.area~Infection, data=tomato)
plot(chlorotic.area~Infection, data=tomato)What happens with an interaction?
They allow you to visualise all the data points
You can apply transformation in your plot formula:
par(mfrow=c(1,2))
plot(log(total.area)~Infection,data=tomato)
plot(I(total.area/100000)~Infection,data=tomato)We can create new variables that combine Treatement and Infection as well as the ratio between both areas, then use it for plotting:
tomato$combo<-as.factor(paste(tomato$Treatement,tomato$Infection,sep=""))
tomato$rate<-tomato$chlorotic.area/tomato$total.area
tomato$combo<-factor(tomato$combo,levels=levels(tomato$combo)[c(2,1,4,3)]) #here we reorder the levels
plot(rate~combo,data=tomato)Create a new dataset from tomato but excluding rep #5.
Plot side-by-side the histogram of total.area with and without rep 5, as well as chlorotic.area with and without rep 5.
BONUS Create a dataset for each rep and plot the ratio between chlorotic.area/total.area for the for combined levels (combo).
Loops are very useful, as they help us do similar operations multiple times while writing them only once. Let’s say you want to compute the mean chlorotic area for treated and untreated plants.
mean_Mock <- mean(tomato$chlorotic.area[tomato$Treatement == "mock"])
mean_Treat <- mean(tomato$chlorotic.area[tomato$Treatement == "treated"])
cat("Mean chlorotic area (mock):", mean_Mock, "square mm" )## Mean chlorotic area (mock): 104751.8 square mm
## Mean chlorotic area (treated): 58670.88 square mm
and so on for every variable. Using a loop, it would look like this:
for (i in c('mock', 'treated')) {
mean_value <- mean(tomato$chlorotic.area[tomato$Treatement == i])
cat("Mean chlorotic area (",i,"):", mean_Mock, "square mm\n")
} # the \n character forces a line break## Mean chlorotic area ( mock ): 104751.8 square mm
## Mean chlorotic area ( treated ): 104751.8 square mm
It may not seems like much of a gain, but imagine if you had multiple levels to compute? Without a loop, you need 2 extra lines of code for every level. In a loop, you only need to add conditions to your vector.
Treatment, Infection and
rep and print it to the console using cat()
;Objectives:
Learn the tidyverse syntax as an intuitive way to prepare our datasets.
learn to use the ggplot package!
install.packages('tidyverse') # super-package that includes ggplot
When working with large datasets with multiple variables, R can become cumbersome. Imagine having to change 20 character variables to factors, or modifying a recurring typo in a data frame with 2,000 rows… fortunately, it happened to enough people for the community to create new functions to help you write concise, powerful scripts.
Yesterday, we covered functions that are included “by default” in R,
what we call the Base R language. Over time, a large library of
functions was developed, called tidyverse. These functions
let you modify your data in a very intuitive manner.
Let’s learn the tidyverse syntax!
We introduce a modern syntax from the tidyverse
superpackage, which uses the pipe operator %>%. You can
print it by using the shift+cmd+m keyboard combination (or
shift+ctrl+m on Windows).
To show its usefulness in making codes much easier to read, consider the following challenge: Compute the mean proportion of chlorotic area of infected plants that were NOT treated, and format it to print as a percentage with 2 decimals..
A traditional base R code could look like this:
which_observations <- which(tomato$Infection == 'pathogen' & tomato$Treatement == 'mock')
inf_treat_chlor_area <- tomato$chlorotic.area[which_observations]
inf_treat_total_area <- tomato$total.area[which_observations]
prop_area <- 100*inf_treat_chlor_area/inf_treat_total_area
paste('Mean proportion:', round(mean(prop_area),2), "%")## [1] "Mean proportion: 5.97 %"
Now compare the same thing with the tidyverse syntax. We will use 3 new functions, namely:
-mutate() is used to create new variables from existing
ones
-filter() subsets our rows based on the values of
variables
-pull() which returns a dataset column as a vector
We’ll also introduce the pipe %>% operator, which is
also available in base R as |> (choose your
favourite!).
library(tidyverse)
tomato_subset <- tomato %>%
filter(Infection == 'pathogen' & Treatement == 'mock') %>%
mutate(prop_area = 100*chlorotic.area/total.area)
# Compute the mean
tomato_subset %>%
pull(prop_area) %>%
mean() %>%
round(2) %>%
paste("Mean proportion:", ., "%")## [1] "Mean proportion: 5.97 %"
Let’s switch to a new dataset with thousands of observations. We’ll
look at it briefly, then select a few columns of interest using
select() and modify variable classes using
mutate()
## Warning: Expecting date in L1160 / R1160C12: got 'NA'
## Warning: Coercing text to numeric in Q1411 / R1411C17: '1.5'
## Warning: Coercing text to numeric in R1750 / R1750C18: '91.7'
## Warning: Expecting date in L2453 / R2453C12: got '20;08'
## Rows: 3,774
## Columns: 24
## $ ID <dbl> 1, 1, 2, 3, 3, 3, 3, 3, 3, 3, 4, 5, 6, 6, 6, 6, 6, 6, …
## $ Sex <chr> "M", "M", "F", "F", "F", "F", "F", "F", "F", "F", "F",…
## $ Age <chr> "6", "8", "16", "0", "4", "5", "6", "7", "9", "10", "A…
## $ `Age certainty` <dbl> 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 0, 1, 4, 4, 4, 4, 4, 4, …
## $ Date <chr> "Jul 27 08", "Nov 30 10", "Jul 27 08", "Jul 27 08", "S…
## $ No. <dbl> 1, 2, 1, 1, 2, 3, 4, 5, 6, 7, 1, 1, 1, 2, 3, 4, 5, 6, …
## $ East <chr> "7945", "8011", "8273", "8273", "8301", "8405", "8388"…
## $ North <chr> "8411", "8325", "8362", "8362", "8163", "8337", "8431"…
## $ Method <chr> "Crossbow", "Pole", "Crossbow", "PY", "T-pole", "T-pol…
## $ Year <dbl> 2008, 2010, 2008, 2008, 2012, 2013, 2014, 2015, 2017, …
## $ Day <dbl> 209, 334, 209, 209, 253, 246, 253, 295, 340, 255, 225,…
## $ Time <dttm> NA, NA, NA, NA, NA, NA, 1904-01-01 07:45:00, 1904-01-…
## $ Mass <dbl> 41.00, 46.25, 28.50, NA, 23.25, 26.00, 28.25, 28.50, 2…
## $ Foot <dbl> 355.0, 367.0, 314.0, 174.0, 316.0, 317.0, 322.0, 323.0…
## $ Leg <dbl> 595.0, 613.0, 515.0, 212.0, 502.0, 509.0, 517.0, 528.0…
## $ Arm <dbl> 295, 316, NA, NA, 198, 212, 214, 222, 229, 225, 220, N…
## $ Teeth <dbl> 3.0, 0.5, 3.0, NA, 2.0, 2.0, 1.5, 2.0, 1.5, 1.0, 1.0, …
## $ `Head length` <dbl> NA, NA, NA, 99.4, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ Zygo <chr> NA, NA, NA, NA, "89.7", "88.1", "92.5", "95.2", "95", …
## $ Notes <chr> NA, NA, "mother of #3", "mother is #2", NA, "Mother of…
## $ Measurer <dbl> 2, 1, 2, 2, 1, 1, 1, 8, 9, 1, 2, 2, 2, 1, 1, 3, 1, 1, …
## $ `new date` <dttm> 2008-07-27, 2010-11-30, 2008-07-27, 2008-07-27, 2012-…
## $ `new birthdate` <dttm> NA, NA, NA, 2007-12-20, 2007-12-20, 2007-12-20, 2007-…
## $ `age months` <dbl> NA, NA, NA, 7.236842, 56.743421, 68.552632, 80.789474,…
kangoroos <- kangoroos %>%
select(ID, Sex, Age, Mass, Leg, Date, Teeth) %>%
mutate(ID = as.factor(ID),
Sex = as.factor(Sex),
Age = as.numeric(Age)) ## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion
These warnings are suspicious! We’ll investigate them.
We can take advantage of that syntax to create more informative
summaries of our data by using group_by(), a way to operate
on subsets of our data without actually subsetting it; and passing the
grouped dataframe to summarise(), which allows mathematical
functions to be applied to every group.
## Rows: 3,774
## Columns: 7
## $ ID <fct> 1, 1, 2, 3, 3, 3, 3, 3, 3, 3, 4, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7…
## $ Sex <fct> M, M, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F…
## $ Age <dbl> 6, 8, 16, 0, 4, 5, 6, 7, 9, 10, NA, 0, 12, 13, 14, 15, 16, 17, 1…
## $ Mass <dbl> 41.00, 46.25, 28.50, NA, 23.25, 26.00, 28.25, 28.50, 29.50, 27.7…
## $ Leg <dbl> 595.0, 613.0, 515.0, 212.0, 502.0, 509.0, 517.0, 528.0, 533.0, 5…
## $ Date <chr> "Jul 27 08", "Nov 30 10", "Jul 27 08", "Jul 27 08", "Sept 9 12",…
## $ Teeth <dbl> 3.0, 0.5, 3.0, NA, 2.0, 2.0, 1.5, 2.0, 1.5, 1.0, 1.0, NA, 2.0, 2…
kang_group_summary <- kangoroos %>%
drop_na(Age, Sex) %>%
group_by(Sex) %>%
summarise(mean_age = mean(Age),
sd_age = sd(Age),
n_obs = n()
)
kang_group_summary## # A tibble: 2 × 4
## Sex mean_age sd_age n_obs
## <fct> <dbl> <dbl> <int>
## 1 F 4.99 4.55 1900
## 2 M 2.87 3.91 1107
Doing this using base R syntax would have required many intermediate objects, whereas here we get the end product with a single pipe that does all we need.
We can also format this table if we wanted to export it. The
kableExtra package is quite useful for this. We first round the values
to the second decimal using mutate(), then change the
column names and style the table.
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
kang_group_summary %>%
mutate(across(where(is.numeric), ~ round(., 1))) %>%
kable(col.names = c("Sex", "Mean Age", "SD Age", "N")) %>%
kable_styling("striped")| Sex | Mean Age | SD Age | N |
|---|---|---|---|
| F | 5.0 | 4.5 | 1900 |
| M | 2.9 | 3.9 | 1107 |
Here’s a fun operator, the assignment pipe %<>%.
It acts as both a pipe and an assignment. It allows you to modify an
object without creating a new object. You need to explicitly load the
magrittr library even if tidyverse is loaded already.
## Warning: package 'magrittr' was built under R version 4.3.3
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
kangoroos %<>%
mutate(Date = case_when(
str_detect(Date, 'Sept') ~ str_replace(Date, 'Sept', 'Sep'),
TRUE ~ Date),
Date = parse_date_time(Date, orders = c('%m%d%y')),
across(where(is.character), as.factor)
)
kangoroos## # A tibble: 3,774 × 7
## ID Sex Age Mass Leg Date Teeth
## <fct> <fct> <dbl> <dbl> <dbl> <dttm> <dbl>
## 1 1 M 6 41 595 2008-07-27 00:00:00 3
## 2 1 M 8 46.2 613 2010-11-30 00:00:00 0.5
## 3 2 F 16 28.5 515 2008-07-27 00:00:00 3
## 4 3 F 0 NA 212 2008-07-27 00:00:00 NA
## 5 3 F 4 23.2 502 2012-09-09 00:00:00 2
## 6 3 F 5 26 509 2013-09-03 00:00:00 2
## 7 3 F 6 28.2 517 2014-09-10 00:00:00 1.5
## 8 3 F 7 28.5 528 2015-10-22 00:00:00 2
## 9 3 F 9 29.5 533 2017-12-06 00:00:00 1.5
## 10 3 F 10 27.8 524 2018-09-12 00:00:00 1
## # ℹ 3,764 more rows
Using the summarise() command, find the mean Leg length
and body mass for each sex at every age.
In this intro to GGplot2 we will look at mapping, layers, and scales.
In ggplot(), you map your variables to what are called
aesthetics and you add on layers to generate a plot that uses those
mappings. You can then add in scales to further manipulate your
plot.
We will map or select the data using aes(). Then we will
select the type of plot we want using using a geom, for example:
geom_histogram(); geom_point();
geom_boxplot(), geom_col(), geom_line(). We
will make 4 plots, for each one we will learn a new syntax,slowly adding
layers to the plot.
The simplest plot which only uses a x or y aesthetic, to count frequencies.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We’ll map another variable and add colours, using the
color = ... argument in geom_point()
selecting color using color = ... and
fill = ... in geom_boxplot() #Add labels and
titles using labs(title=“name”, x =“name”, y=“name”)
kangoroos %>%
drop_na(Sex, Mass) %>%
ggplot(aes(x = Sex, y = Mass)) +
geom_boxplot(color="black", fill = "orangered") +
labs(title = "Boxplot", x = "Treatment", y = "Total Area") Selecting color using aes(fill = ...) and adding labels
and titles.
kangoroos %>%
group_by(Age, Sex) %>%
summarise(mean_mass = mean(Mass)) %>%
drop_na() %>%
ggplot(aes(x = Age,
y = mean_mass,
fill = Sex)
) +
geom_col() +
labs(title = "Mean mass by age", x = "Age", y = "Mass") ## `summarise()` has grouped output by 'Age'. You can override using the `.groups`
## argument.
By default geom_col will stack each observation on top of each other
for each x axis value. You can place them next to each other using
geom_col(position = "dodge")
kangoroos %>%
group_by(Age, Sex) %>%
summarise(mean_mass = mean(Mass)) %>%
drop_na() %>%
ggplot(aes(x = Age,
y = mean_mass,
fill = Sex)
) +
geom_col(position = 'dodge') +
labs(title = "Mean mass by age", x = "Age", y = "Mass") ## `summarise()` has grouped output by 'Age'. You can override using the `.groups`
## argument.
Say you want to separate your plots by infection to look at the
effects separately. You can facet your plots using a categorical
variable using facet_wrap() or
facet_grid().
kangoroos %>%
drop_na() %>%
ggplot(aes(x = Leg,
y = Mass,
colour = Age)) +
geom_point() +
facet_wrap(~Sex) +
labs(title = "Mass and length of leg along age", x = "Mass", y = "Leg length") We can add all data points on top of a box plot using
geom_jitter() Notice what happens when you place jitter
before vs. after geom_boxplot. We’ll also change the theme here, because
that’s an option!
kangoroos %>%
drop_na() %>%
ggplot(aes(x = Sex,
y = Mass,
fill = Sex)) +
geom_boxplot() +
geom_jitter(width = 0.2, alpha = 0.5, size = 1) +
labs(x = "Sex", y = "Mass") +
theme_light()geom_line() is the function used to make line graphs.
Let’s make the x axis the rownames of our table, and plot the total area
on the y axis while colouring by treatment. We also add labels and
titles using labs()
kangoroos %>%
drop_na() %>%
ggplot(aes(x = Age,
y = Mass,
group = ID,
color = Sex)
) +
geom_line() +
labs(x = "Age", y = "Mass") That plot does not look good, let’s change the order of the x axis
using scale_x_discrete(). We want R to consider rownames as
numeric for this so let’s tell R that this data should be numeric.
geom_violin()Method variablefacet_grid())scale_fill_manual() function)ggsave() function as a 1400 x
1200 pixels file in .png format## Warning: package 'patchwork' was built under R version 4.3.3
Objectives
By the end of this session, you will:
Understand ggplot2’s theme system Apply built-in and custom themes Combine multiple plots with patchwork Create publication-quality figure layouts
You can find all the possible ggplot2 themes here: https://ggplot2.tidyverse.org/reference/ggtheme.html
data <- gapminder %>% filter(year == 2007)
p <- data %>%
ggplot(aes(gdpPercap, lifeExp)) +
geom_point(aes(size = pop, color = continent)) +
scale_x_log10() +
labs(title = "GDP vs Life Expectancy",
x = "GDP per capita",
y = "Life Expectancy")
# Demonstrate theme options
p + theme_dark()You can modify greatly all aspects of your figures. There are numerous resources online to browse colours in R but here is one that I like to use: https://colorbrewer2.org/#type=sequential&scheme=Reds&n=3.
p + theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
panel.background = element_rect(fill = "aliceblue"),
legend.position = "bottom",
axis.text = element_text(color = "darkblue")
)
## Facetwraps
Now we will do this figure (boxplot) step by step to learn how to incrementally build a ggplot2 figure.
Let’s first reload the tomato (clean) data and relabel the factors to practice our newly learned syntax.
#Load the data
tomato<-read_xlsx("data/Rocio_tomato_modified.xlsx",sheet=2)
#Prepare the relabel function for both variables for clean plotting
relabel.Treatement = as_labeller(c(mock = "Mock",
treated = "Treated"))
relabel.Infection = as_labeller(c(pathogen = "Pathogen",
water = "Water"))
#Transform our dataset with the %<>%
tomato %<>%
rowwise() %>%
mutate(Treatement = factor(relabel.Treatement(Treatement))) %>%
mutate(Infection = factor(relabel.Infection(Infection)))Okay, let’s start to build our plot!
tomato %>%
ggplot(aes(Infection,chlorotic.area/total.area,shape=Infection))+
geom_boxplot(aes(fill=Infection)) Let’s
add the facet!
(first_plot <- tomato %>%
ggplot(aes(Infection,chlorotic.area/total.area,shape=Infection))+
geom_boxplot(aes(fill=Infection))+
facet_wrap(~Treatement)) Great,
now we can change our theme to the dark side!
tomato %>%
ggplot(aes(Infection,chlorotic.area/total.area,shape=Infection))+
geom_boxplot(aes(fill=Infection))+
facet_wrap(~Treatement)+
theme_dark()Let’s add a jitter to see better each observation.
tomato %>%
ggplot(aes(Infection,chlorotic.area/total.area,shape=Infection))+
geom_boxplot(aes(fill=Infection))+
facet_wrap(~Treatement)+
theme_dark()+
geom_jitter(width = 0.1,aes(fill=Treatement))## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
Let’s change the colours and the shape of our point to be fillable.
tomato %>%
ggplot(aes(Infection,chlorotic.area/total.area,shape=Infection))+
geom_boxplot(aes(fill=Infection))+
facet_wrap(~Treatement)+
theme_dark()+
geom_jitter(width = 0.1,aes(fill=Treatement))+
scale_fill_manual(values = c("#69B3A2","#de2d26","white","white"))+
scale_shape_manual(values = c(21,21))+
ylab(label="Leaf chlorotic area (%)")+
guides(fill=F,shape=F)## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
Perfect, now on to changing the Y label to something more publication ready. We can also remove the legend.
(final_plot <- tomato %>%
ggplot(aes(Infection,chlorotic.area/total.area,shape=Infection))+
geom_boxplot(aes(fill=Infection))+
facet_wrap(~Treatement)+
theme_dark()+
geom_jitter(width = 0.1,aes(fill=Treatement))+
scale_fill_manual(values = c("#69B3A2","#de2d26","white","white"))+
scale_shape_manual(values = c(21,21))+
ylab(label="Leaf chlorotic area (%)")+
guides(fill=F,shape=F))## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
Create your own version of this plot, but changing the theme, the colours, the shape of the points, the X and Y labels, and adding a main title.
Patchwork is an amazing package that allows you to create perfect multi-plot figures for your publications.
Here is an overview of the possibilities:
p1 <- ggplot(mpg, aes(displ, hwy)) + geom_point()
p2 <- ggplot(mpg, aes(class)) + geom_bar()
p3 <- ggplot(mpg, aes(cty)) + geom_histogram()
# Horizontal layout
p1 + p2## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Here are some additional options:
# Control spacing and guides
(p1 + p2 + p3) +
plot_layout(guides = "collect") +
plot_annotation(tag_levels = "A")Using the boxplot figures that we created from the tomato dataset, combine three different plots (the first one and the final one that we created together + the one you modified)
Combine them into a 1x3 horizontal layout Add a shared title “Walk through a tomato plotting contest” Label subplots as A, B, C
Using Marco’s dataset :
pole is replaced by
Pole and roadkill is replaced by
Roadkill (because both are there). You can try the
str_to_title() function which will also take care of
RoadKill!library(magrittr)
library(ggridges)
kang_roo <- read_xlsx('data/Marco_kangoroos.xlsx') %>%
select(ID,
Sex, Age, Mass, Leg, Method, Teeth,
Date) %>%
mutate(ID = as.factor(ID),
Sex = as.factor(Sex),
Age = as.numeric(Age)
) ## Warning: Expecting date in L1160 / R1160C12: got 'NA'
## Warning: Coercing text to numeric in Q1411 / R1411C17: '1.5'
## Warning: Coercing text to numeric in R1750 / R1750C18: '91.7'
## Warning: Expecting date in L2453 / R2453C12: got '20;08'
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Age = as.numeric(Age)`.
## Caused by warning:
## ! NAs introduced by coercion
age_levels <-
kang_roo %>%
pull(Age) %>%
unique()
kang_roo %<>%
mutate(Method = str_to_sentence(Method))
these_methods <- kang_roo %>%
group_by(Method) %>%
summarise(obs.count = n()) %>%
arrange(desc(obs.count)) %>%
pull(Method) %>%
head(3)
kang_roo %>%
filter(Method %in% these_methods)## # A tibble: 3,581 × 8
## ID Sex Age Mass Leg Method Teeth Date
## <fct> <fct> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
## 1 1 M 8 46.2 613 Pole 0.5 Nov 30 10
## 2 3 F 0 NA 212 Py NA Jul 27 08
## 3 3 F 4 23.2 502 T-pole 2 Sept 9 12
## 4 3 F 5 26 509 T-pole 2 Sept 3 13
## 5 3 F 6 28.2 517 T-pole 1.5 Sept 10 14
## 6 3 F 7 28.5 528 T-pole 2 Oct 22 15
## 7 3 F 9 29.5 533 T-pole 1.5 Dec 6 17
## 8 3 F 10 27.8 524 T-pole 1 Sept 12 18
## 9 4 F NA 28 522 Pole 1 Aug 12 08
## 10 5 F 0 NA 95.4 Py NA Aug 12 08
## # ℹ 3,571 more rows
The functions setwd() and getwd() are your friends.
Navigate in the Files panel with the blue wheel make sure that you see your working directory.
To start, we will load the tomato data and look at it.
tomato<-read_xlsx("data/Rocio_tomato_modified.xlsx",sheet=2)
tomato$Infection<-as.factor(tomato$Infection)
tomato$Infection[which(tomato$Infection=="pathogn")]<-"pathogen"
tomato$Infection<-droplevels(tomato$Infection)
tomato <- tomato %>%
mutate(ratio = chlorotic.area / total.area)
pathogen_only <- tomato %>%
filter(Infection == "pathogen")What should we do first?
Then we could check for normality:
##
## Shapiro-Wilk normality test
##
## data: pathogen_only$ratio
## W = 0.85178, p-value = 0.002359
OH MY GOD a p-value, what does this mean? Because it is <0.05, we can reject our H0 that the data is distributed normally. Thus, our data does not appear to follow a normal distribution.
Here is a different way to perform normality tests on all the variables you want:
variables_to_test <- c("chlorotic.area", "total.area", "ratio")
# Loop over variables and apply Shapiro-Wilk test
shapiro_results <- tomato %>%
select(all_of(variables_to_test)) %>% # Select only the target variables
map(~ shapiro.test(.x)) %>% # Apply Shapiro-Wilk test to each column
map_dfr(~ broom::tidy(.x), .id = "variable") # Convert results to a tidy data frame
# Print results
print(shapiro_results)## # A tibble: 3 × 4
## variable statistic p.value method
## <chr> <dbl> <dbl> <chr>
## 1 chlorotic.area 0.731 0.0000000494 Shapiro-Wilk normality test
## 2 total.area 0.945 0.0250 Shapiro-Wilk normality test
## 3 ratio 0.731 0.0000000493 Shapiro-Wilk normality test
The second assumption is equal variance in our data, let’s test it:
##
## F test to compare two variances
##
## data: ratio by Treatement
## F = 1.9132, num df = 11, denom df = 11, p-value = 0.297
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.5507728 6.6459465
## sample estimates:
## ratio of variances
## 1.913219
Okay, so let’s imagine our data is normal, what is the next step, look at our data again:
pathogen_only %>% ggplot(aes(Treatement, ratio, fill=Treatement, shape=Treatement))+
geom_boxplot()+geom_jitter(width=0.1,aes(fill="white"),size=3)+theme_classic()+
scale_shape_manual(values=c(21,21))+
scale_fill_manual(values = c("#69B3A2","#de2d26","white"))+
labs(title="Ratio of chlorotic and total areas impacted by treatment",x="Treatment",y="Ratio")+
guides(fill="none",color="none",shape="none")In this plot, our H0 is that there is no difference in the mean of the Y variable between the two groups. Our H1 could be that there is a difference (less or more), or we could already have hypothesized that the treated group will have a lower ratio because of literature or our previous knowledge. Thus, we could expect significant results, no? So let’s run a t-test:
t.test(ratio~Treatement,data=pathogen_only) # R defaults to the unequal variance test Welch Two Sample t-test##
## Welch Two Sample t-test
##
## data: ratio by Treatement
## t = 2.993, df = 20.032, p-value = 0.007179
## alternative hypothesis: true difference in means between group mock and group treated is not equal to 0
## 95 percent confidence interval:
## 0.009977977 0.055854593
## sample estimates:
## mean in group mock mean in group treated
## 0.05972684 0.02681056
t.test(ratio~Treatement,data=pathogen_only, var.equal=T) # default is two-sided# default is two-sided##
## Two Sample t-test
##
## data: ratio by Treatement
## t = 2.993, df = 22, p-value = 0.006702
## alternative hypothesis: true difference in means between group mock and group treated is not equal to 0
## 95 percent confidence interval:
## 0.01010862 0.05572395
## sample estimates:
## mean in group mock mean in group treated
## 0.05972684 0.02681056
##
## Two Sample t-test
##
## data: ratio by Treatement
## t = 2.993, df = 22, p-value = 0.9966
## alternative hypothesis: true difference in means between group mock and group treated is less than 0
## 95 percent confidence interval:
## -Inf 0.05180078
## sample estimates:
## mean in group mock mean in group treated
## 0.05972684 0.02681056
##
## Two Sample t-test
##
## data: ratio by Treatement
## t = 2.993, df = 22, p-value = 0.003351
## alternative hypothesis: true difference in means between group mock and group treated is greater than 0
## 95 percent confidence interval:
## 0.01403179 Inf
## sample estimates:
## mean in group mock mean in group treated
## 0.05972684 0.02681056
Let’s now run the non-parametric equivalent, the Mann-Whitney-Wilcoxon Test:
##
## Wilcoxon rank sum exact test
##
## data: ratio by Treatement
## W = 124, p-value = 0.00183
## alternative hypothesis: true location shift is not equal to 0
Still significant, we can now reject our H0. It seems like the means of the two groups are not equal. Did we demonstrate H1? No! But we cannot accept H0.
Create a water_only dataset Look at the histogram of the ratio variable Check for the normality and homoskedasticity of the ratio variable Perform a t.test and a Mann-Whitney-Wilcoxon
Let’s use the Moss data set and plot the Nitrogen fixation (ARA_C2H4) across sampling months in 2017 independently for each moss species.
# Load data
Moss <- read_excel("data/Marie_Mosses.xlsx")
# Transform some variables as factors
Moss$Year <- as.factor(Moss$Year)
Moss$Months <- factor(Moss$Months, levels = c("June","September","October"))
# Subset to keep only data from 2017
Moss2017 <- Moss[Moss$Year=="2017",]
# Subset to keep only data from one moss specie
Moss2017_PCC <- Moss2017[Moss2017$Species == "Ptilium crista-castrensis",]
# Plot it
ggplot(data = Moss2017_PCC, aes(x = Months, y = ARA_C2H4, fill = Months))+
geom_boxplot()+
theme_bw()+
scale_fill_manual(values = c("green","green4", "darkgreen"))We want to evaluate if there is a significant difference in the nitrogen fixation across time. Therefore, we want to compare the mean of 3 groups (June, September, October).
Analysis of Variance
shapiro.test()) :##
## Shapiro-Wilk normality test
##
## data: Moss2017_PCC$ARA_C2H4[Moss2017_PCC$Months == "June"]
## W = 0.89578, p-value = 0.387
##
## Shapiro-Wilk normality test
##
## data: Moss2017_PCC$ARA_C2H4[Moss2017_PCC$Months == "September"]
## W = 0.95196, p-value = 0.7512
##
## Shapiro-Wilk normality test
##
## data: Moss2017_PCC$ARA_C2H4[Moss2017_PCC$Months == "October"]
## W = 0.93141, p-value = 0.606
leveneTest()) :## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 1.08 0.3704
## 12
# Using anova()
res <- anova(lm(Moss2017_PCC$ARA_C2H4 ~ Moss2017_PCC$Months))
res # No difference between groups (Months)## Analysis of Variance Table
##
## Response: Moss2017_PCC$ARA_C2H4
## Df Sum Sq Mean Sq F value Pr(>F)
## Moss2017_PCC$Months 2 1081.9 540.95 1.6143 0.2394
## Residuals 12 4021.3 335.11
# Using aov()
res2 <- aov(Moss2017_PCC$ARA_C2H4 ~ Moss2017_PCC$Months)
summary(res2) # No difference between groups (Months)## Df Sum Sq Mean Sq F value Pr(>F)
## Moss2017_PCC$Months 2 1082 540.9 1.614 0.239
## Residuals 12 4021 335.1
According to the anova, there is no difference between sampling months.
If the anova had revealed a difference, we would have performed a Tukey test (parametric post-hoc) to know between which group there is a difference.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Moss2017_PCC$ARA_C2H4 ~ Moss2017_PCC$Months)
##
## $`Moss2017_PCC$Months`
## diff lwr upr p adj
## September-June 13.779040 -17.10865 44.66673 0.4809339
## October-June 20.386686 -10.50101 51.27438 0.2239070
## October-September 6.607645 -24.28005 37.49534 0.8379383
Let’s use the flowers data set and plot the alpha-diversity for each site (code).
# Import dataset
flowers <- read_excel("data/meta.flower.2022.xlsx")
# Transform the variable code as factor
flowers$code <- as.factor(flowers$code)
# Plot alpha-diversity for each site (code)
(plot.flower <- ggplot(flowers, aes(code, alphadiv, fill = code))+
geom_boxplot()+
theme_bw())##
## Shapiro-Wilk normality test
##
## data: flowers$alphadiv[flowers$code == "A"]
## W = 0.96274, p-value = 0.8265
##
## Shapiro-Wilk normality test
##
## data: flowers$alphadiv[flowers$code == "B1"]
## W = 0.74743, p-value = 0.005061
##
## Shapiro-Wilk normality test
##
## data: flowers$alphadiv[flowers$code == "C"]
## W = 0.82122, p-value = 0.03557
## # A tibble: 3 × 9
## .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
## * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
## 1 alphadiv A B1 9 9 0.564 0.573 0.573 ns
## 2 alphadiv A C 9 9 -3.33 0.000881 0.00176 **
## 3 alphadiv B1 C 9 9 -3.89 0.000100 0.000301 ***
Let’s plot the statistical result on the plot
# Store dunn test result in an object
dunn.res <- dunn_test(flowers, alphadiv~code)
# Manually extract p-values from dunn test results
p_values <- dunn.res$p.adj.signif[c(1, 2, 3)]
# Find x and y positions for the p-value
stat.flower <- add_xy_position(dunn.res, x = "code")
# List of comparison
comparisons <- list(c("A", "B1"), c("A", "C"), c("B1", "C"))
# Add on the plot
plot.flower+
geom_signif(comparisons = comparisons,
annotations = p_values,
y_position = stat.flower$y.position) To conclude, if we want two compare the mean between more than two
groups we first need to know if we use a parametric (Anova) or a
non-parametric (Kruskal-Wallis) test. So we first test the normality of
all group using shapiro.test() (p<0.05: Not normal;
p>0.05 Normal) and the homogeneity of the variance using
leveneTest().
Depending on the results, we will perform an Anova or a Kruskal test. If the result of the Anova is significant, we then perform a post-hoc Tukey test. If the the result of the Kruskal test is dignificant, we then perform a post-hoc Dunn test.
Pearson = parametric
Spearman = non-parametric
####Test normality:
##
## Shapiro-Wilk normality test
##
## data: Moss2017$Fe_ppm
## W = 0.76572, p-value = 1.652e-05
##
## Shapiro-Wilk normality test
##
## data: Moss2017$P_ppm
## W = 0.97159, p-value = 0.5834
##
## Shapiro-Wilk normality test
##
## data: Moss2017$CN_Ratio
## W = 0.96198, p-value = 0.3476
##
## Shapiro-Wilk normality test
##
## data: Moss2017$`N_mg.g-1`
## W = 0.96918, p-value = 0.517
####Perform correlation test:
##
## Pearson's product-moment correlation
##
## data: Moss2017$P_ppm and Moss2017$CN_Ratio
## t = -2.3966, df = 28, p-value = 0.02347
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.67283141 -0.06143071
## sample estimates:
## cor
## -0.4125691
# p<0.05: Reject H0 -> there is a correlation (r = -0.4)
# Spearman (non-parametric)
plot(Moss2017$Fe_ppm, Moss2017$P_ppm)##
## Spearman's rank correlation rho
##
## data: Moss2017$Fe_ppm and Moss2017$P_ppm
## S = 6922, p-value = 0.002405
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.5399333
# p<0.05: Reject H0 -> there is a correlation (r = -0.5)
# Pearson (parametric)
plot(Moss2017$`N_mg.g-1`, Moss2017$CN_Ratio)##
## Pearson's product-moment correlation
##
## data: Moss2017$`N_mg.g-1` and Moss2017$CN_Ratio
## t = -23.305, df = 28, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9882499 -0.9479519
## sample estimates:
## cor
## -0.9751798
Test if there is a correlation between two other variables from the
moss data set
- Test normality of both variables
- Perform appropriate correlation test
- Plot it and add correlation coefficient and p-value
For this analysis we want to know how roof age (the explanatory variable) influences the response variable plant cover (Moss, Suc) let’s start with Moss, we will need to filter out all other cover types.
shapiro.test() Get Moss as close to normality as
possible (reciprocal, logarithm, cube root, square root, and square)
##
## Shapiro-Wilk normality test
##
## data: log10(Cut$Cover)
## W = 0.94629, p-value = 0.3143
Here moss will be the response variable and age will be the
explanatory variable. lm() is the function for simple
regression.
##
## Call:
## lm(formula = log10(Cover) ~ Age, data = Cut)
##
## Coefficients:
## (Intercept) Age
## 1.24809 0.01137
This code works, but it only gives us the Coefficients. To get more details let’s save it in an object and use the summary function:
##
## Call:
## lm(formula = log10(Cover) ~ Age, data = Cut)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.09620 -0.20377 0.02189 0.30202 0.54444
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.24809 0.20464 6.099 9.21e-06 ***
## Age 0.01137 0.01372 0.829 0.418
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4522 on 18 degrees of freedom
## Multiple R-squared: 0.03678, Adjusted R-squared: -0.01673
## F-statistic: 0.6873 on 1 and 18 DF, p-value: 0.418